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Abstract 

In some real world applications, such as spectrometry, functional models 
achieve better predictive performances if they work on the derivatives of order 
m of their inputs rather than on the original functions. As a consequence, the 
use of derivatives is a common practice in functional data analysis, despite a 
lack of theoretical guarantees on the asymptotically achievable performances 
of a derivative based model. In this paper, we show that a smoothing spline 
approach can be used to preprocess multivariate observations obtained by 
sampling functions on a discrete and finite sampling grid in a way that leads 
to a consistent sche me on the origin a l infin ite dimensional functional problem. 



This work extends iMas and Pumd (120091 ) to nonparametric approaches and 



incomplete knowledge. To be more precise, the paper tackles two difficulties 
in a nonparametric framework: the information loss due to the use of the 
derivatives instead of the original functions and the information loss due to 
the fact that the functions are observed through a discrete sampling and are 
thus also unperfectly known: the use of a smoothing spline based approach 
solves these two problems. Finally, the proposed approach is tested on two 
real world datasets and the approach is experimentaly proven to be a good 
solution in the case of noisy functional predictors. 
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As the measurement techniques are developping, more and more data 
are high dimensional vectors generated by measuring a continuous process 
on a discrete samphng grid. Many examples of this type of data can be 
found in real world applications, in various fields such as spectrometry, voice 
recognition, time series analysis, etc. 

Data of this type should not be handled in the same way as standard 
multivariate observations but rather analysed as functional data: each ob- 
servation is a function coming from an input space with infinite dimension, 
sampled on a high resolution sampling grid. This leads to a large number 
of variables, generally more than the number of observations. Moreover, 
functional data are frequently smooth and generate highly correlated vari- 
ables as a consequence. Applied to the obtained high dimensional vectors, 
classical statistical methods (e.g., linear regression, factor analysis) often 
lead to ill-posed problems, especially when a covariance matrix has to be 
inverted (this is the case, e.g., in linear regression, in discriminant analy- 
sis and also in sliced inverse regression). Indeed, the number of observed 
values for each function is generally larger than the number of functions 
itself and these values are often strongly correlated. As a consequence, 
when these data are considered as multidimensional vectors, the covari- 
ance matrix is ill-conditioned and leads to unstable and unaccurate solu- 
tions in models where its inverse is required. Thus, these methods cannot 
be directly used. During past years, several methods have been adapted to 
that particular context and grouped under the generic name of Functional 
Data Analysis (FDA) method s. Se minal works focused on linear meth- 



Besse and Ramsavl (119861) 



ods such as factor i al ana lv sis (iDevilld ( Il974r): iDauxois and Poussd ( 1l976l ): 



James et al.l (12000f). among others) and linear 



models iRamsay and Dalzelll (Il99ll ) ; ICardot et al.l ( 1l999l ) ; IJames and Hastie 



(120011): a comprehensiye presentatio n of linear FDA methods is given in 



Ramsay and Silverman! (119971 |2002| ). More recently, nonlinear functional 



mo dels have bee n extensively developed and i nclude generalized linear mod 
els iJamed (I20021): iJames and Silverman! (12005! ) , kernel nonpar ametric regres- 



sion 



Ferraty and Vieu! ( 2006|). Functional Inverse Re g ression iFerre and Yao 
( I2OO3I ) , neural net works iRossi and Cqnan-Guez! (|2005l) ; iRossi et al.l ( l2005l ) , k 



nearest neig hbors iBiau et al.l ( l2005h : iLaloel ( l2008l )~ Support Vector Machines 
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(SVM), iRossi and Villal (120061 ) . among a very large variety of methods. 

In previous works, numerous authors have shown that the derivatives 
of the functions lead sometimes to better predictive performances than the 
functions themselves in inference tasks, as they provide information about 
the shape or t he regularity of th e funct io n. In partic u lar ap p lications such as 



spectr ometry iFerraty and Vieul (I20061): [Rossi et al.l (120051 ): iRossi and Villa 



( 2006[). micro-array d ata Deiean et al. ( 2007h an d hand writing recognition 



Williams et al.l ( l2006l ): iBahlmann and Burkhardtl ( l2004j ). these characteris- 
tics lead to accurate predictive models. But, on a theoretical point of the 
view, limited results about the effect of the use of the derivatives instead 



of the original functions are available: iMas and Pumd (120091 ) studies this 
problem for a linear model built on the first derivatives of the functions. In 
the present paper, we also focus on t he th eoretical relevance of this common 
practice and extend iMas and Pumd (120091 ) to nonparametric approaches and 
incomplete knowledge. 

More precisely, we address the problem of the estimation of the condi- 
tional expectation E of a random variable Y given a functional random 
variable X. Y is assumed to be either real valued (leading to a regression 
problem) or to take values in {—1,1} (leading to a binary classification prob- 
lem). We target two theoretical difficulties. The first difficulty is the potential 
information loss induced by using a derivative instead of the original function: 
when one replaces X b y its order m deri v ative X^™"-*, consistent estimators 
(such as kernel models IFerraty and Vieu guarantee an asymptotic 
estimation of E (Y\X^"^^^ but cannot be used directly to address the original 
problem, namely estimating E(y|X). This is a simple consequence of the 
fact that X is not a one to one mapping. The second difficulty 
is induced by sampling: in practice, functions are never observed exactly 
but rather, as explained above, sampled on a discrete sampling grid. As a 
consequence, one relies on approximate derivatives, Xr™'^ (where r denotes 
the sampling grid). This approach induces even more information loss with 
respect to the underlying functional variable X: in general, a consistent es- 
timator of E will not provide a consistent estimation of E(y|X) 

and the optimal predictive performances for Y given Xt"*'' will be lower than 
the optimal predictive performances for Y given X. 

We show in this paper that the use of a smoothing spline based approach 
solves both problems. Smoothing splines are used to estimate the functions 
from their sampled version in a convergent way. In addition, properties of 
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73 splines are used to obtain estimates of the derivatives of the functions with no 

74 induced information loss. Both aspects are implemented as a preprocessing 
step applied to the multivariate observations generated via the sampling grid. 
The preprocessed observations can then be fed into any finite dimensional 
consistent regression estimator or classifier, leading to a consistent estima- 
tor for the original infinite dimensional problem (in real world applications, 

79 we instantiate the general scheme i n the particular case of kernel machines 
Shawe- Taylor and Cristianini 



80 



81 The remainder of the paper is organized as follows: Section [2] introduces 

82 the model, the main smoothness assumption and the notations. Section [3] 

83 recalls important properties of spline smoothing. Section H] presents approx- 

84 imation results used to build a general consistent classifier or a general con- 

85 sistent regression estimator in Section [51 Finally, Section |6] illustrates the 

86 behavior of the proposed method for two real world spectrometric problems. 

87 The proofs are given at the end of the article. 

88 2. Setup and notations 

89 2.1. Consistent classifiers and regression functions 

90 We consider a pair of random variables (X, Y) where X takes values in 

91 a functional space X and Y is either a real valued random variable (regres- 

92 sion case) or a random variable taking values in { — 1, 1} (binary classifica- 

93 tion case). From this, we are given a learning set Sn = {(Xj,Fj)}"^^ of n 

94 independent copies of {X,Y). Moreover, the functions Xi are not entirely 

95 known but sampled according to a non random sampling grid of finite length, 

96 Td = (ti)l=l: we only observe XJ'* = {Xi{ti), . . . Xi(t|^^|))^, a vector of R^^''^ 

97 and denote Sn,Ta the corresponding learning set. Our goal is to construct: 

98 1. in the binary classification case: a classifier, 4>n,Td^ whose misclassifica- 

99 tion probability 

L{^^J=F{^^^JX^^)^Y) 

100 asymptotically reaches the Bayes risk 

L* = inf P (6(X) ^ Y) 

101 i.e., lim|^^|^+oolim„^+ooE (^(0„,rd)) = ^* ! 
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102 2. in the regression case: a regression function, whose error 

^(0n,.J=E([0„,.,(X-)-F]2) 

103 asymptotically reaches the minimal error 

L*= inf E([0(X-'')-rp) 

(p:A. — >-M. 

104 i.e., lim|^^|^+oo lim„^+oo ^(0„,tJ = 

105 This definition implicitly requires E (F^) < oo and as a consequence, 

106 corresponds to a convergence of to the conditional expectation 
0* = E(r|X), i.e., to lim,,,|^+oolim„l^+ooE ([0„^,^(X-'') - 0*(X)]2) = 

108 0. 



Such (j)„ ^ aie s aid to be (weakly) consistent iDevroye et al.l ( 119961 ): 



Gyorfi et al.l (|2002| ). We have deliberately used the same notations for the 
(optimal) predictive performances in both the binary classification and the 
regression case. We will call L* the Bayes risk even in the case of regression. 
Most of the theoretical background of this paper is common to both the re- 
gression case and the classification case: the distinction between both cases 
will be made only when necessary. 

As pointed out in the introduction, the main difficulty is to show that 
the performances of a model built on the Xj'' asymptotically reach the best 
performance achievable on the original functions Xi. In addition, we will 
build the model on derivatives estimated from the X^"*. 

2.2. Smoothness assumption 

Our goal is to leverage the functional nature of the data by allow- 
ing differentiation operators to be applied to functions prior their submis- 
sion to a more common classifier or regression function. Therefore we as- 
sume that the functional space X contains only differentiable functions. 
More precisely, X is the Sobolev space T-L"^ = j/?- G L^([0,1]) | Vj = 

1, . . . ,m, D^h exists in the weak sense, and D"^h G -^^^([0, 1])|, where D^h 

is the j-th derivative of h (also denoted by h^^^) and for an integer m > 1. 
Of course, by a straightforward generalization, any bounded interval can be 
considered instead of [0, 1]. 
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130 To estimate the underlying functions Xi and their derivatives from sam- 

131 pled data, we rely on smoothing splines. More precisely, let us consider 

132 a deterministic function x G "H™ sampled on the aforementioned grid. A 

133 smoothing spline estimate of x is the solution, xx^ra-, 

of 

arg mm ^ Y.{x{ti) - h{ti)f + A / {h^-\t)fdt, (1) 
heu^ \Td\ ^ J[o,i] 

134 where A is a regularization parameter that balances interpolation error and 

135 smoothness (measured by the norm of the m-th derivative of the esti- 

136 mate). The goal is to show that a classifier or a regression function built 

137 on is consistent for the original problem (i.e., the problem defined by 

138 the pair (X,Y)): this means that using X^^ instead of X has no dramatic 

139 consequences on the accuracy of the classifier or of the regression function. 

140 In other words, asymptotically, no information loss occurs when one replaces 

141 XhyX^^K 



142 The proof is based on the following steps: 

143 1. First, we show that building a classifier or a regression function on 

144 -^x^d appi'oximately equivalent to building a classifier or a regression 

145 function on X^'* = {X(ti))\^[ using a specific metric. This is done by 

146 leveraging the Reproducing Kernel Hilbert Space (RKHS) structure of 

147 T-L"^. This part serves one main purpose: it provides a solution to work 

148 with estimation of the derivatives of the original function in a way 

149 that preserves all the information available in X'^''. In other words, the 

150 best predictive performances for Y theoretically available by building a 

151 multivariate model on X'^'* are equal to the best predictive performances 

152 obtained by building a functional model on X^^ . 

153 2. Then, we link E with E(y|X) by approximation results 

154 available for smoothing splines. This part of the proof handles the 

155 effects of sampling. 

156 3. Finally, we glue both results via standard Rl'^'^l consistency results. 
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3. Smoothing splines and differentiation operators 
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3.1. RKHS and smoothing splines 

As we want to work on derivatives of functions from "H™, a natural in- 
ner product for two functions of "H™" would be (m, v) — )■ 
However, we prefer to use an inner product oirr {J^ u^"'\t)v^"'\t)dt only 
induces a semi-norm on Ti^) because, as will be shown later, such an in- 
ner product is related to an inner product between the sampled functions 
considered as vectors of RI'^'^L 

This can be don e by decomposing T-L"^ into "H™ = T-L^ © H'^ 
Kimeldorf and Wahbal fll97lh . where Ti}^ = KerD"" = P'^-^ (the space of 
polynomial functions of degree less or equal to m — 1) and "H™ is an infi- 
nite dimensional subspace of 'H™' defined via m boundary conditions. The 
boundary conditions are given by a full rank linear operator from T-L"^ to M™, 
denoted B, such that KerB fl P™~i = {Q}. Classical examples of boundary 
conditions include the case of "natural splines" (for m = 2, h{0) = h{l) = 0) 
and constraints that target only the first values of h and its derivatives at 
a fixed position, for instance the cond itions: h(0) = ... = h^^~^H O ) = 0. 



Other boundary condition s can be used iBerlinet and Thomas-Agnanl (120041 ); 



Besse and Ramsayl (119861 ) ; I Craven and Wahbal (119781 ) , depending on the ap 



plication. 

Once the boundary conditions are fixed, an inner product on both 
and "H™ can be defined: 



u^^\ty'^\t)dt 



(rn). 







179 is an inner product on 7/™ (as h G "H™ and D'^h = give h = 0). Moreover, 

180 if we denote B = {B^)'JLi, then {u,v)o = Y^^=iB^uB^v is an inner product 

181 on 7{™. We obtain this way an inner product on "H™" given by 



{u,v) 



u^"'\t)v^"'\t)dt + J2 B^uBh 



{VT{u),VT{v)), + {v^{u),vnv)). 



182 where P™' is the projector on 'H™' 

183 Equipped w ith (.,.)-^m, 



is a Reproducing Kernel Hilbert Space 



184 (RKHS, see e.g. iBerlinet and Thomas-Agnanl (120041 ): iHeckman and Ramsay 
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185 
186 
187 
188 
189 
190 
191 
192 



fl2000h : IWahbal fll990h l More precisely, it exists a kernel k : [0, 1]^ ^ M such 
that, for all u G H"^ and all t G [0, 1], (m, k(t, = u(t). The same occurs 
for and which respectively have reproducing kernels denoted by fco 
and ki. We have k = ko + ki. 

In the most c ommon cases, kn and fci hav e already been explicitly cal- 
culated (see e.g., iBerlinet and Thomas-AgnanI (120041 ). especially chapter 6, 
sections 1.1 and 1.6.2). For example, for m > 1 and the boundary conditions 
/i(0) = h'{0) = ... = /i('"-^)(0) = 0, we have: 

"^-1 ^kgk 



ko{s,t) = 



k=0 



{k\y 



and 



ki{s,t) 



m— 1 



[m 



■ dw. 



194 3.2. Computing the splines 
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196 
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198 



We need now to compute t o xa.tj starting with x^ '' = (x (t))fc-^^. 



201 
202 



204 
205 



This 

can be done via a theorem from iKimeldorf and Wahbal (jl97ll ). We need the 
following compatibility assumptions between the sampling grid and the 
boundary conditions operator B: 

Assumption 1. The sampling grid Td = (iOl=i ^■^ such that 

1. sampling points are distinct in [0, 1] and \Td\ > m — 1 

2. the m boundary conditions are linearly independent from the \Td\ 
linear forms h h{ti), for Z = 1, . . . , |rrf| (defined on W^) 

Then xa.t^j and x"^"^ = {x{t))J^^^ are linked by the following result: 

Theorem 1 ( Kimeldorf and Wahbal ( 197ll )). Under Assumption (J^, the 
unique solution x^.r^ to equation (PP) is given by: 



s 



where S\^rd is a full rank linear operator from M'^'*! to defined by: 



(2) 



(3) 



with 
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Mo = {U{K, + XId)-'U^) ' U{K^ + A/rf)-i 
Ml = (7^1 + A/,)-i (/,-f/^Mo); 



210 
211 



213 
214 
215 
216 



{ui, . . . ,Um} is a basis of ¥ 



\m—l 



U) 



,,m t&T^' 



r] = {hit, and = {h{t,t'))t,t 



3.3. No information loss 

The first important consequence of Tlieorem [T] is tliat building a model 
on Xx.T^ or on X'^'* leads to the same optimal predictive performances (to the 
same Bayes risk). This is formalized by the following corollary: 

Corollary 1. Under Assumption (JQ^, we have 

• in the binary classification case: 



(4) 



inf P (0(X^'*) ^ Y) 

^di-+{-i,i} 



in the regression case: 



inf E 



X 



inf E 



Y 



fx 



Yf) 



(5) 



220 
221 
222 
223 

224 
225 



226 
227 
228 



3.4. Differentiation operator 

The second important consequence of Theorem [T] is that the inner product 
(., .)^m is equivalent to a specific inner product on M''^''' given in the following 
corollary: 

Corollary 2. Under Assumption (j^J\) and for any u"^** = {u{t))J^^^ and 



terd 



\Td\ 



{ux,Td,Vx,rd) 



where M 



W^WMq + MlKiMi with W = {{wi,Wj)o)ij=i^ 



(6) 
The 



matrix is symmetric and positive definite and defines an inner product 



on 
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229 The corollary is a direct consequence of equations ([2]) and (E]). 

230 In practice, the corollary means that the euclidean space ^M''^'*!, (., .)m;^ 

231 is isomorphic to {lx,Ta, where Ix^t^ is the image of M'^'*! by Sx^r^- As 

232 a consequence, one can use the Hilbert structure of "H™ directly in M}^''^ via 

233 l\/l\^ra'- the inner product of "H™ is defined on the order m derivatives of the 

234 functions, this corresponds to using those derivatives instead of the original 

235 functions. 

236 More precisely, let Qa.t^ be the transpose of the Cholesky triangle of ^x,Ta 

237 (given by the Cholesky decomposition Q^^^^Q^^r^ = Ma^t^). Corollary [2] 

238 shows that Qa.t^ ^-cts as an approximate differentiation operation on sampled 

239 functions. 

240 Let us indeed consider an estimation method for multivariate inputs based 

241 only on inner products or norms (that are directl y derived from the in - 



242 ner products), such as, e.g., Kernel Ridge Regression ISaunders et al.l ( 1l998l ) ; 



Shawe- Taylor and Cristianinil ( l2004j ). In this latter case, if a Gaussian kernel 



244 is used, the regression function has the following form: 

n 

VTiaie-^l'^'-^IlK^ (7) 



1=1 



245 where {Ui, Ti)i<i<n are learning examples in MP x { — 1, 1} and the are non 

246 negative real values obtained by solving a quadratic programming problem 

247 and 7 is a parameter of the method. Then, if we use Kernel Ridge Regression 

248 on the training set {(Qa.t^XJ'*, "Kj)}"^]^ (rather than the original training set 

249 {(XJ'*, yj)}"^^), it will work on the norm in of the derivatives of order 

250 m of the spline estimates of the Xj (up to the boundary conditions). More 

251 precisely, the regression function will have the following form: 



n 2 



i=l 

I 2 



1=1 



252 In other words, up to the boundary conditions, an estimation method based 

253 solely on inner products, or on norms derived from these inner products. 
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254 can be given modified inputs that will make it work on an estimation of the 

255 derivatives of the observed functions. 

256 Remark 1. As shown in Corollary [T] in the previous section, building a 

257 model on X"^"^ or on X\^ra leads to the same optimal predictive performances. 

258 In addition, it is obvious that given any one-to-one mapping / from M''^'*! to 

259 itself, building a model on /(X'^'') gives also the same optimal performances 

260 than building a model on X^'*. Then as Qa.t^j is invertible, the optimal 

261 predictive performances achievable with Qa^t^X'^'* are equal to the optimal 

262 performances achievable with X'^'* or with Xx ^^. 

263 In practice however, the actual preprocessing of the data can have a strong 

264 influence on the obtained performances, as will be illustrated in Section El 

265 The goal of the theoretical analysis of the present section is to guarantee 

266 that no systematic loss can be observed as a consequence of the proposed 

267 functional preprocessing scheme. 



268 4. Approximation results 



274 
275 
276 
277 
278 
279 
280 
281 
282 
283 
284 
285 



The previous section showed that working on X^**, Qa^t^X^'' or X\^t-^ 
makes no difference in terms of optimal predictive performances. The present 
section addresses the effects of sampling: asymptotically, the optimal predic- 
tive performances obtained on Xx^ra converge to the optimal performances 
achievable on the original and unobserved functional variable X. 

4-1- Spline approximation 

From the sampled random function X"^*^ = (X(ti), . . . ,X{t\T-^\)), we can 
build an estimate, Xx^t^^ of To ensure consistency, we must guarantee 
that Xx,Ta converges to X. In the case of a determini stic function x, this 
proble r n has bee i i stud i ed iri nume r ous pape r s, suc h as I Craven and Wahba 
( 1l978h : iRagozinI fll983h : ICo^ fll984h : lutrerasi (llQSSh : IWahbal f ll990h famong 
others) . Here we recall one of the results which is particularly well adapted 
to our context. 

Obviously, the sampling grid must behave correctly, whereas the infor- 
mation contained in X^'' will not be sufficient to recover X. We need also 



the regularization parameter A to depend on r^. Following iRagoziru (119831 ). 
a sampling grid is characterized by two quantities: 



A. 



Td 



-Td 



max{ti, ^2 - 

{ti+l 



Mrd\} 



mm 

l<i<|rd| 



ti}. 



(8) 
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286 One way to control the distance between X and Xx^t^ is to bound the ratio 

287 At-^/A^^ so as to ensure quasi- uniformity of the samphng grid. 
More precisely, we will use the following assumption: 



Assumption 2. There is R such that A^-^/A^^ < R for all d. 
Then we have: 



291 



Theorem 2 ( Ragozin ( 19831 )). Under Assumptions and (AB^, th 



ere are 



292 two constants A^^m o^nd -B_R,m depending only on R and m, such that for any 

293 X G "H™ and any positive A; 



294 This result is a rephrasing of Corollary 4.16 from iRagozinI ( 1l983l ) which 

295 is itself a direct consequence of Theorem 4.10 from the same paper. 

296 Convergence of xa.t^ to x is then obtained by the following simple as- 

297 sumptions: 

298 Assumption 3. The series of sampling points and the series of regular- 

299 ization parameters, \, depending on and denoted by {Xd)d>i, ore such that 

300 lim(i_,+oo kdl = +00 and lim^^+oo = 0. 

301 4-^- Conditional expectation approximation 

302 The next step consists in relating the optimal predictive performances 

303 for the regression and the classification problem {X, Y) to the performances 

304 associated to (Xa^^t^;^) when d goes to infinity, i.e., relating L* to 

305 1. binary classification case: 

306 2. regression case: 

307 Two sets of assumptions will be investigated to provide the convergence 

308 of the Bayes risk to L*: 
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309 Assumption 4. Either 

310 (/^) E (WD'^XWl^) IS fimte and Y e {-1, 1}, 

311 or 

312 (^^^) Td C Td+i and'E{Y'^) is finite. 

313 The first assumption (A Hfej) requires an additional smoothing property for 

314 the predictor functional variable X and is only valid for a binary classifica- 

315 tion problem whereas the second assumption (A Bfel) requires an additional 

316 property for the sampling point series: they have to be growing sets. 

317 Theorem |2] then leads to the following corollary: 

318 Corollary 3. Under Assumptions (j^J^-(j^^, we have: 

lim L: = L*. 

a— >+oo 



319 5. General consistent functional classifiers and regression functions 
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5.1. Definition of classifiers and regression functions on derivatives 

Let us now consider any consistent classification or regression scheme 
for standard multivariate data based either on the inner product or on 
the Euclidean distance between observations. Examples of such classifiers 



are Support Vector Machine ISteinwartI (120021 ) , the ker nel classification rule 



Devrove and KrzvzakI (119891 ) and fc-nearest neighbors iDevroye and Gyorfi 
(Il985h: IZhad (11987|) to n ame a few. In the same way, multilayer percep trons 



Lugosi and Zegeir ( 1990 ). kern el estimates Devrove and Krzyzak (1989) and 



/c-nearest neighbors regression iDevroye et al.l (119941 ) are consistent regression 
estimators. Additional exa mples of consis t ent es ti mators in cla s sificat ion and 
regression can be found in IDevroye et al.l (119961 ) ; iGyorfi et al.l (120021 ) . 

We denote ipiy the estimator constructed by the chosen scheme using a 
dataset V = {(t/j, Tj)i<j<„}, where the Tj)i<j<„ are n independent copies 
of a pair of random variables {U, T) with values in x {— 1, 1} (classification) 



or 



X 



[regression) 



The proposed functional scheme consists in choosing the estimator (/'n.Td 
as ipe^ 



-n with the dataset £n,Ta defined by: 



{(QAd,7-dX['', Yi)i<i<n} 
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337 As pointed out in Section the linear transformation Qa^j.t^ is an approx- 

338 imate multivariate differentiation operator: up to the boundary conditions, 

339 an estimator based on Qa^^t^X'^'' is working on the m-th derivative of -^a^.t^- 

340 In more algorithmic terms, the estimator is obtained as follows: 

341 1. choose an appropriate value for 

342 2. compute Ma^^t^ using Theorem [T] and Corollary [21 

343 3. compute the Cholesky decomposition of MA^,rd and the transpose of 



344 the Cholesky triangle, Qa^.t^ (such that ^ Q\d,Td = ^ 



345 4. compute QA^,rdX['* to obtain the transformed dataset £, 



347 



348 



346 5. build a classifier/regression function ipe,,^^ with a multivariate method 
in RI'^'^I applied to the dataset £n,Td^ 



6. associate to a new sampled function the prediction 

350 Figure 15.11 illustrates the way the method performs: instead of relying 

351 on an approximation of the function and then on the derivation preprocess- 

352 ing of this estimates, it directly uses an equivalent metric by applying the 

353 Qx^,Ta matrix to the sampled function. The consistency result proved in The- 

354 orem[3]shows that, combined with any consistent multidimensional learning 

355 algorithm, this method is (asymptotically) equivalent to using the original 

356 function drawn at the top left si de of Fi g ure 15. 1[ 



On a practical point of view, Wahbal ( 1990 ) demonstrates that cross vaL 



358 idated estimates of A achieve suitable convergence rates. Hence, steps 1 and 

359 2 can be computed simultaneously by minimizing the total cross validated 

360 error for all the observations, given by 



^ \rd\ fl 



= 1 ' ' ter^ 



361 where A is a |rrf| x Ir^^l matrix called the influence matrix (see IWahbal (jl990l )). 
over a finite number of A values. 



362 
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Figure 1: Method scheme and its equivalence to the usual approach for using derivatives 
in learning algorithms. 
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363 5.2. Consistency result 

364 Corollary [1] and Corollary |3] guarantee that the estimator proposed in the 

365 previous section is consistent: 

366 Theorem 3. Under assumptions (J(Il)-(J^, the series of classi- 

367 fiers /regression functions {(f)nTjn,d is consistent: 

hm hin E {L<P^J = L* 

368 5. 3. Discussion 

369 While Theorem [3] is very general, it could be easily extended to cover 

370 special cases such as additional hypothesis needed by the estimation scheme 

371 or to provide data based parameter selections. We discuss briefly those issues 

372 in the present section. 

373 It should first be noted that most estimation schemes, ipu, depend on 

374 parameters that should fulfill some assumptions for the scheme to be con- 

375 sistent. For instance, in the Kernel Ridge Regression method in M^, with 

376 Gaussian kernel, ipv has the form given in Equation ([7]) where the (aj) are 

377 the solutions of 

n / n \ 2 

arg min I Ti - ^ Tjaje'^^^^'-^'^^^" ) + 



at 

i=l \ j=l 



n 

2 



Sn TiT. 



378 The method thus depends on the parameter of the Gaussian kernel, 7 and 

379 of the regularization parameter 5n.- This method is kn own to be consistent if 



see Theorem 9.1 of Steinwart and Christmann ( 2008[ )): 



On > and no„ )■ +00. 

Additional conditions of this form can obviously be directly integrated in 
Theorem [3] to obtain consistency results specific to the corresponding algo- 
rithms. 

Moreover, practitioners generally rely on data based selection of the pa- 
rameters of the estimation scheme ipxi via a validation method: for instance, 
rather than setting 5„ to e.g., for n observations (a choice which is com- 
patible with theoretical constraints on one chooses the value of 5„ that 
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optimizes an estimation of the performances of the regression function ob- 
tained on an independent data set (or via a re-samphng approach). 

In addition to the parameters of the estimation scheme, functional data 
raise the question of the convenient order of the derivative, m, and of the 
samphng grid optimahty. In practical applicati ons, the nu r aber o f available 
sampling points can be unnecessarily large (see iBiau et al.l (120051 ) for an ex- 



ample with more than 8 000 sampling points). The preprocessing performed 
by Qx^,Ta do not change the dimensionality of the data which means that 
overfitting can be observed in practice when the number of sampling points 
is large compared to the number of functions. Moreover, processing very 
high dimensional vectors is time consuming. It is there quite interesting in 
practice to use a down-sampled version of the original grid. 

To select the parameters of ipD, the order of the derivative and/or the 
down-sampled grid, a vahdation strategy, based on splitting the dataset into 
trainin g and validation set s , could be used. A sirnple a d aptation of the 
ideaof lBerlinet et all fl2008h : lBiau et all fl2005h : lLaloel fl2008h : [Rossi and Villa 



(120061 ) shows that a penalized validation method can be used to choose any 
combination of those parameters consistently. According to those papers, 
the condition for the consistency of the validation strategy would simply 
relate the shatter coefficients of the set of classifiers in to the penalization 
parameter of the validation. Once again, this type of results is a rather direct 
extension of Theorem [31 



6. Applications 
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In this section, we show that the proposed approach works as expected on 
real world spectrometric examples: for some applications, the use of deriva- 
tives lea ds to more accurat e models than the direc t processing of the spectra 
(see e.g. iRossi et al.l (120051 ) : lRossi and Villal (120061 ) for other examples of such 
a behavior based on ad hoc estimators of the spectra derivatives). It should 
be noted that the purpose of this section is only to illustrate the behavior 
of the proposed method on finite datasets. The theoretical results of the 
present paper show that all consistent schemes have asymptotically identical 
performances, and therefore that using derivatives is asymptotically useless. 
On a finite dataset however, preprocessing can have strong infiuence on the 
predictive performances, as will be illustrated in the present section. In ad- 
dition, schemes that are not universally consistent, e.g., linear models, can 
lead to excellent predictive performances on finite datasets; such models are 
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424 therefore included in the present section despite the fact the theory does not 

425 apply to them. 

426 6.1. Methodology 

427 The methodology followed for the two illustrative datasets is roughly the 

428 same: 

429 1. the dataset is randomly split into a training set on which the model is 

430 estimated and a test set on which performances are computed. The split 

431 is repeated several times. The Tecator dataset (Section 16. 2p is rather 

432 small (240 spectra) and exhibits a rather large variability in predic- 

433 tive performances between different random splits. We have therefore 

434 used 250 random splits. For the Yellow-berry dataset (Section [6.3p . we 

435 used only 50 splits as the relative variability in performances is far less 

436 important. 

437 2. A is chosen by a global leave-one-out strategy on the spectra contained 

438 in training set (as suggested in Section [5?T1) . More precisely, a leave-one- 

439 out estimate of the reconstruction error of the spline approximation of 

440 each training spectrum is computed for a finite set of candidate values 

441 for A. Then a common A is chosen by minimizing the average over 

442 the training spectra of the leave-one-out reconstruction errors. This 

443 choice is relevant as cross validatio n estimates of A are known to have 

444 favorable theoretical properties (see ICraven and Wahbal (119781 ) ; lUtreras 

445 ( 1l98ll ) among others). 



446 3. for regression prob ler ns, a Kernel Ridge Reg r ession (KRR) 

447 Saunders et al. (1998); Shawe- Taylor and Cristianini ( 2004[ ) is then 

448 performed to estimate the regression function; this method is consistent 

449 when used with a Gaussian ker nel under additional condit i ons on the 

450 parameters (see Theorem 9.1 of Steinwart and Christmann ( 20081 )): as 

451 already explained, in the applications. Kernel Ridge Regression is per- 

452 formed both with a Gaussian kernel and with a linear kernel (in that 

453 last case, the model is essentially a ridge regression model). Parameters 

454 of the models (a regularization parameter, 5„, in all cases and a ker- 

455 nel parameter, 7 for Gaussian kernels) are chosen by a grid search that 

456 minimizes a validation based estimate of the performances of the model 

457 (on the training set). A leave-one-out solution has been chosen: in Ker- 

458 nel Ridge Regression, the leave-one-out estimate of the performances of 
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the model is obtained as a by-product of the estimation process, wi th- 
out additional computation cost, see e.g. ICawley and TalbotI ( 120041 ). 
Additionally, for a sake of comparison with a more traditional approach 
in FDA, Kernel Ridge Regression is compared with a nonparametric 
kernel estimate for the Tecator dataset (Section 16. 2. II) . Nonparametric 
kernel estimate is the first nonparametr i c app roach introduced in Func- 
tional Data Analysis iFerraty and Vied ( 120061 ) and can thus be seen as 
a basis for comparison in the context of regression with functional pre- 
dictors. For this method, the same methodology as with Kernel Ridge 
Regression was used: the parameter of the model (i.e., the bandwidth) 
was selected on a grid search minimizing a cross-validation estimate of 
the performances of the model. In this case, a 4-fold cross validation 
estimate was used instead of a leave-one-out estimate to avoid a large 
computational cost. 

4. for the classification problem, a Supp ort Vector Machine (SVM) is used 
Shawe- Taylor and Cristianinil ( 20041). As KRR, SV M are consistent 



when used with a Gaussian kernel ISteinwartI (120021 ). We also use a 



SVM with a linear kernel as this is quite adapted for classification in 
high dimensional spaces associated to sampled function data. We also 
use a K-nearest neighbor model (KNN) for reference. Parameters of the 
models (a regularization parameter for both SVM, a kernel parameter, 
7 for Gaussian kernels and number of neighbors K for KNN) are chosen 
by a grid search that minimizes a validation based estimate of the 
classification error: we use a 4-fold cross-validation to get this estimate. 

5. We evaluate the models obtained for each random split on the test set. 
We report the mean and the standard deviation of the performance 
index (classification error and mean squared error, respectively) and 
assess the significance of differences between the reported figures via 
paired Student tests (with level 1%). 

6. Finally, we compare models estimated on the raw spectra and on spec- 
tra transformed via the Qa^.t^ matrix for m = 1 (first derivative) and 
m = 2 (second derivative). For both values of m, we used the most 
classical boundary conditions (a;(0) = and Dx{0) = 0). Depending of 
the problem, other boundary conditions co uld be investigated but thi s 
is outside the scope of the pre sent paper (see lBesse and Ramsayl (Il986l ): 
Heckman and Ramsayl (120001 ) for discussion on this subject). For the 
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495 Tecator problem, we also compare these approaches with models es- 

496 timated on first and second derivatives based on interpolating splines 

497 (i.e. with A = 0) and on first and second derivatives estimated by finite 

498 differences. 

499 Note that the kind of preprocessing used has almost no impact on 

500 the computation time. In general, selecting the parameters of the 

501 model with leave-one-out or cross-validation will use significantly more 
computing power than constructing the splines and calculating their 
derivatives. For instance, computing the optimal A with the approach 

504 described above takes less than 0.1 second for the Tecator dataset on a 

505 standard PC using our R implementation which is negligible compared 

506 to the several minutes used to select the optimal parameters of the 
models used on the prepocessed data. 



502 
503 



507 



508 6.2. Tecator dataset 



509 
510 



The first studied dataset is the standard Tecator dataset iThodbergI (jl996[ ) 
0. It consists in spectrometric data from the food industry. Each of the 

511 240 observations is the near infrared absorbance spectrum of a meat sample 

512 recorded on a Tecator Infratec Food and Feed Analyzer. Each spectrum is 

513 sampled at 100 wavelengths uniformly spaced in the range 850-1050 nm. 

514 The composition of each meat sample is determined by analytic chemistry 

515 and percentages of moisture, fat and protein are associated this way to each 

516 spectrum. 

517 The Tecator dataset is a widely used benchmark in Functional Data Anal- 

518 ysis, hence the motivation for its use for illustrative purposes. More precisely, 

519 in Section 16.2. H we address the original regression problem by predicting the 

520 percentage of fat content from the spectra with various regression method 

521 and various estimates of the derivative preprocessing: this analysis shows 

522 that both the method and the use of derivative have a strong effect on the 

523 performances whereas the way the derivatives are estimated has almost no 

524 effect. Additionally, in Section I6.2.2[ we apply a noise (with various vari- 

525 ances) to the original spectra in order to study the influence of smoothing 

526 in the case of noisy predictors: this section shows the relevance of the use of 

527 a smoothing spline approach when the data are noisy. Finally, Section 16.2.31 

528 deals with a classification problem derived from the original Tecator problem 



^Data are available on statlib at httpV/lib.stat.cmu.edu/datasets/tecator 
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529 (in the same way as what was done in lFerraty and Vieul (120031 )): conclusions 

530 of this section are similar to the ones of the regression study. 



531 6.2.1. Fat content prediction 

532 As explained above, we first address the regression problem that consists 

533 in predicting the fat content of peaces of meat from the Tecator dataset. The 

534 parameters of the model are optimized with a grid search using the leave-one- 

535 out estimate of the predictive performances (both models use a regularization 

536 parameter, with an additional width parameter in the Gaussian kernel case). 

537 The original data set is split randomly into 160 spectra for learning and 80 

538 spectra for testing. As shown in the result Table [1], the data exhibit a rather 

539 large variability; we use therefore 250 random split to assess the differences 

540 between the different approaches. 

541 The performance indexes are the mean squared error (M.S.E.) and the 

542 As a reference, the target variable (fat) has a variance equal to 14.36. 

543 Results are summarized in Table [H 

544 The first conclusion is that the method itself has a strong effect on the 

545 performances of the prediction: for this application, a linear method is not 

546 appropriate (mean squared errors are much greater for linear methods than 

547 for the kernel ridge regression used with a Gaussian kernel) and the non- 
548 parametric kernel estimate gives worse performances than the kernel ridge 
549 regression (indeed, they are about 10 times worse). Nevertheless, for non- 
550 parametric approaches (Gaussian KKR and NKE), the use of derivatives 
551 has also a strong impact on the performances: for kernel ridge regression, 
652 e.g., preprocessing by estimating the first order derivative leads to a strong 

553 decrease of the mean squared error. 

554 Differences between the average MSEs are not always significant, but 

555 we can nevertheless rank the methods in increasing order of modeling error 

556 (using notations explained in Table [T]) for Gaussian kernel ridge regression: 

FDl < ISl < Sl< DF2 < SS2 < 1S2 < O 

557 where < corresponds to a significant difference (for a paired Student test 

558 with level 1%) and < to a non significant one. In this case, the data are very 

559 smooth and thus the use of smoothing splines instead of a finite differences 



^i?^ = 1 — ^'^^^'^ where Var(y) is the (empirical) variance of the target variable on the 
test set. 
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Method Data Average M.S.E. Average i?^ 



and SD 


KRR Linear 


8.69 (4.47) 


95.7% 


SI 


8.09 (3.85) 


96.1% 


ISl 


8.09 (3.85) 


96.1% 


FDl 


8.27 (4.17) 


96.0% 


S2 


9.64 (4.98) 


95.3% 


IS2 


9.87 (5.84) 


95.2% 


FD2 


8.45 (4.18) 


95.9% 


KRR Gaussian 


5.02 (11.47) 


97.6% 


SI 


0.485 (0.385) 


99.8% 


ISl 


0.485 (0.385) 


99.8% 


FDl 


0.484 (0.387) 


99.8% 


S2 


0.584 (0.303) 


99.7% 


IS2 


0.586 (0.303) 


99.7% 


FD2 


0.569 (0.281) 


99.7% 


NKE 


73.1 (16.5) 


64.2% 


SI 


4.59 (1.09) 


97.7% 


ISl 


4.59 (1.09) 


97.7% 


FDl 


4.59 (1.09) 


97.7% 


S2 


3.75 (1.22) 


98.2% 


IS2 


3.75 (1.22) 


98.2% 


FD2 


3.67 (1.18) 


98.2% 



Table 1: Summary of the performances of the chosen models on the test set (fat Tecator 
regression problem) when using either a kernel ridge; regression (KRR) with linear ker- 
nel or with Gaussian kernel or when using a nonparametric kernel estimate (NKE) with 
various inputs: O (original data), SI (smoothing splines with order 1 derivatives), ISl (in- 
terpolating splines with order 1 derivatives), FDl (order 1 derivatives estimated by finite 
differences) and S2, IS2 and FD2 (the same as previously with order 2 derivatives). 
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560 approximation does not have a significant impact on the predictions. How- 

561 ever, in this case, the roughest approach, consisting in the estimation of the 

562 derivatives by finite differences, gives the best performances. 

563 6.2.2. Noisy spectra 

564 This section studies the situation in which functional data observations 

565 are corrupted by noise. This is done by adding a noise to each spectrum of 

566 the Tecator dataset. More precisely, each spectrum has been corrupted by 

X^{t)=X,{t) + eu (9) 

567 where (eu) are i.i.d. Gaussian variables with standard deviation equal to 

568 either 0.01 (small noise) or to 0.2 (large noise). 10 observations of the data 

569 generated this way are given in Figure [2j 

570 The same methodology as for the non noisy data has been applied to (X^) 

571 to predict the fat content. The experiments have been restricted to the use of 
672 kernel ridge regression with a Gaussian kernel (according to the nonlinearity 

573 of the problem shown in the previous section). Results are summarized in 

574 Table |2] and Figure [31 

575 In addition, the results can be ranked this way: 
Noise with sd equal to 0.01 

S2 < SI < ISl < O < FDl < IS2 < FD2 

Noise with sd equal to 0.2 

Sl< O < S2 < FDK ISl < IS2 < FD2 

576 where < corresponds to a significant difference (for a paired Student test 

677 with level 1%). 

678 The first conclusion of these experiments is that, even though the deriva- 

679 fives are the relevant predictors, their performances are strongly affected by 

680 the noise (compared to the ones of the original data: note that the average 

681 M.S.E. reported in Tabled] are more 10 times lower that the best ones from 

682 Table [2] and that, in the best cases, is slightly greater than 50% for the 

683 most noisy dataset). In particular, using interpolating splines or finite differ- 

684 ence derivatives leads to highly deteriorated performances. In this situation, 
686 the approach proposed in the paper is particularly useful and helps to keep 



23 



Noise with sd=0.01 




Figure 2: 10 observations of the noisy data generated from the Tecator spectra as in 
Equation m 24 



Noise Data Average M.S.E. Average B? 



and SD 


sd = 0.01 


13.3 (13.5) 


93.5% 


SI 


7.45 (1.5) 


96.4% 


ISl 


12.72 (2.2) 


93.8% 


FDl 


20.03 (2.8) 


90.3% 


S2 


6.83 (1.4) 


96.7% 


IS2 


31.23 (5.9) 


84.9% 


FD2 


31.10 (5.9) 


84.9% 


sd = 0.2 


87.9 (13.9) 


57.4% 


SI 


85.0 (12.5) 


58.8% 


ISl 


210.1 (36.1) 


-1.9% 


FDl 


209.1 (33.0) 


-1.4% 


S2 


95.9 (12.8) 


53.5% 


IS2 


213.7 (33.1) 


-3.6% 


FD2 


235.1 (222.7) 


-14.0% 



Table 2: Summary of the performances of the chosen models on the test set (fat Tecator 
regression problem) with noisy spectra. 
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Noise with scl=0.01 
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Figure 3: Mean squared errors boxplot for the noisy fat Tecator regression problem with 
Gaussian kernel (the worst test samples foggS and FD have been removed for a sake of 
clarity) 



586 better performances than with the original data. Indeed, the differences of 

587 the smoothing sphnes approach with the original data is still significant (for 

588 both derivatives in the "small noise" case and for the first order derivative 

589 in the "high noise" case), even though, the most noisy the data are, the 

590 most difficult it is to estimate the derivatives in an accurate way. That is, 

591 except for smoothing spline derivatives, the estimation of the derivatives for 

592 the most noisy dataset is so bad that it leads to negative when used in 

593 the regression task. 

594 6.2.3. Fat content classification 

595 In this section, the fat content regression problem is transformed into a 

596 classification problem. To avoid imbalance in class sizes, the median value 

597 of the fat in the dataset is used as the splitting criterion: the first class 

598 consists in 119 samples with strictly less than 13.5 % of fat, while the second 

599 class contains the other 121 samples with a fat content equal or higher than 

600 13.5 %. 

601 As in previous sections, the analysis is conducted on 250 random splits of 

602 the dataset into 160 learning spectra and 80 test spectra. We used stratified 

603 sampling: the test set contains 40 examples from each class. The 4 fold 

604 cross-validation used to select the parameters of the models on the learning 

605 set is also stratified with roughly 20 examples of each class in each fold. 

606 The performance index is the mis-classification rate (MCR) on the test 

607 set, reported in percentage and averaged over the 250 random splits. Results 

608 are summarized in Table [31 As in the previous sections, both the model 

609 and the preprocessing have some influence on the results. In particular, 

610 using derivatives always improves the classification accuracy while the actual 

611 method used to compute those derivatives has no particular influence on the 

612 results. Additionally, using interpolation splines leads, in this particular 

613 problem, to results that are exactly identical to the ones obtained with the 

614 smoothing splines: they are not reported in Table [H 

615 More precisely, for the three models (linear SVM, Gaussian SVM and 

616 KNN), differences in mis-classification rates between the smoothing spline 

617 preprocessing and the finite differences calculation is never significant, ac- 

618 cording to a Student test with level 1 %. Additionally while the actual aver- 

619 age mis-classification rates might seem quite different, the large variability of 

620 the results (shown by the standard deviations) leads to significant differences 

621 only for the most obvious cases. In particular, SVM models using derivatives 

622 (of order one or two) are indistinguishable one from another using a Student 
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Method 


Data 


Average MCR 


SD of MCR 


Linear SVM 





1.41 


1.55 




SI 


0.73 


1.15 




FDl 


0.74 


1.15 




S2 


0.94 


1.27 




FD2 


0.92 


1.23 


Gaussian SVM 





3.39 


2.57 




SI 


0.97 


1.41 




FDl 


0.98 


1.42 




S2 


0.99 


2.00 




FD2 


0.97 


1.27 


KNN 





22.0 


5.02 




SI 


6.67 


2.55 




FDl 


6.57 


2.55 




S2 


1.93 


1.65 




FD2 


1.93 


1.63 



Table 3: Summary of the performances of the chosen models on the test set (Tecator fat 
classification problem). See Table [1] for notations. MCR stands for mis-classification rate, 
SD for standard deviation. 
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623 test with level 1 %: all methods with less than 1 % of mean mis-classification 

624 rate perform essentially identically. Other differences are significant: for in- 

625 stance the linear SVM used on raw data performs significantly worse than 

626 any SVM model used on derivatives. 

627 It should be noted that the classification task studied in the present sec- 

628 tion is obviously simpler than the regression task from which it is derived. 

629 This explains the very good predictive performances obtained by simple mod- 

630 els such as a linear SVM, especially with the proper preprocessing. 

631 6.3. Yellow-berry dataset 

632 The goal of the last experiment is to predict the presence of yellow-berry in 

633 durum wheat ( Triticum durum) kernels via a near infrared spectral analysis 

634 (see Figure H]). Yellow-berry is a defect of the durum wheat seeds that reduces 

635 the quality of the flour produced from affected wheat. The traditional way 

636 to assess the occurrence of yellow-berry is by visual analysis of a sample of 

637 the seed stock. In the current application, a quality measure related to the 

638 occurrence of yellow-berry is predicted from the spectrum of the seed. 




200 400 600 300 



Figure 4: 20 observations of NIR spectra of durum wheat 
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639 The dataset consists in 953 spectra sampled at 1049 wavelengths uni- 

640 formly spaced in the range 400-2498 nm. The dataset is split randomly into 

641 600 learning spectra and 353 test spectra. Comparatively to the Tecator 

642 dataset, the variability of the results is smaller in the present case. We used 

643 therefore 50 random splits rather than 250 in the previous section. 

644 The regression models were build via a Kernel Ridge Regression approach 

645 using a linear kernel and a Gaussian kernel. In both cases, the regularization 

646 parameter of the model is optimized by a leave-one-out approach. In addi- 

647 tion, the width parameter of the Gaussian kernel is optimized via the same 

648 procedure at the same time. 

649 The performance index is the mean squared error (M.S.E.). As a refer- 

650 ence, the target variable has a variance of 0.508. Results are summarized in 
Table H] and Figure [51 



Kernel and Data 


Average M.S.E. 


Standard deviation 


Average 


Linear- 


0.122 


8.7710- 


-3 


76.1% 


Linear-Sl 


0.138 


9.53 10- 


-3 


73.0% 


Linear-S2 


0.122 


8.4110- 


-3 


76.1% 


Gaussian- 


0.110 


20.210- 


-3 


78.5% 


Gaussian-Sl 


0.0978 


7.92 10- 


-3 


80.9% 


Gaussian-S2 


0.0944 


8.3510- 


-3 


81.5% 



Table 4: Summary of the performances of the chosen models on the test set (durum wheat 
regressfon problem) 



651 

652 As in the previous section, we can rank the methods in increasing order 

653 of modelling error, we obtain the following result: 

G-S2 < G-Sl < G-0 < L-0 < L-S2 < L-Sl, 

654 where G stands for Gaussian kernel and L for linear kernel (hence G-S2 stands 

655 for kernel ridge regression with gaussian kernel and smoothing splines with 

656 order 2 derivatives); < corresponds to a significant difference (for a paired 

657 Student test with level 1%) and < to a non significant one. For this appli- 

658 cation, there is a significant gain in using a non linear model (the Gaussian 

659 kernel). In addition, the use of derivatives leads to less contrasted perfor- 

660 mances that the ones obtained in the previous section but it still improves 

661 the quality of the non linear model in a significant way. In term of normal- 

662 ized mean squared error (mean squared error divided by the variance of the 
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Figure 5: Mean squared error boxplots for the "durum wheat" regression problem (see 
Table |4] for the full names of the regression models) 
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663 target variable), using a non linear model with the second derivatives of the 

664 spectra corresponds to an average gain of more than 5% (i.e., a reduction of 

665 the normalised mean squared error from 24% for the standard linear model 

666 to 18.6%). 



667 7. Conclusion 

668 In this paper we proposed a theoretical analysis of a common practice that 

669 consists in using derivatives in classification or regression problems when the 

670 predictors are curves. Our method relies on smoothing splines reconstruction 

671 of the functions which are known only via a discrete deterministic sampling. 

672 The method is proved to be consistent for very general classifiers or regres- 

673 sion schemes: it reaches asymptotically the best risk that could have been 

674 obtained by constructing a regression/classification model on the true ran- 

675 dom functions. 

676 We have validated the approach by combining it with nonparametric re- 

677 gression and classification algorithms to study two real-world spectrometric 

678 datasets. The results obtained in these applications confirm once again that 

679 relying on derivatives can improve the quality of predictive models compared 

680 to a direct use of the sampled functions. The way the derivatives are esti- 

681 mated does not have a strong impact on the performances except when the 

682 data are noisy. In this case, the use of smoothing splines is quite relevant. 

683 In the future, several issues could be addressed. An important practical 

684 problem is the choice of the best order of the derivative, m. We consider 

685 that a model selection a pproach relying on a p enalized error loss could be 



686 used, as is done, in e.g., iRossi and Villal (120061 ). to select the dimension of 



truncated basis representation for functional data. Note that in practice, 
such parameter selection method could lead to select m = and therefore to 
automatically exclude derivative calculation when it is not needed. This will 
extend the application range of the proposed model. 

A second important point to study it the convergence rate for the method. 
It would be very convenient for instance, to be able to relate the size of 
the sampling grid to the number of functions. But, this latter issue would 
require the use of additional assumptions on the smoothness of the regression 
function whereas the result presented in this paper, even if more limited, only 
needs mild conditions. 
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824 9. Proofs 

825 9.1. TheoremUl 



826 
827 



830 
831 



840 
841 



In the original theorem (Lemma 3.1) in iKimeldorf and Wahbal ( 119711 ). 
one has to verify that {ko{ti, spans and that {ki{ti, are linearly 

828 independent. These are consequences of Assumption (A[T]). 

829 First, ko{s,t) = YlTfJc^l^^'' sH^ where B = (6[ ~ |^)i,,- is the in- 
verse of {Yl^iB^s'^B''P)i^j (see iHeckman and Ramsayl (|2000[ )). Then 



(A;o(ti,s),...,A;o(Vdl'^)) = O-^s.-.-.s"^ ^)5[Kn~i(ti, • • • , t|r,|)]^ where 

832 Vm-i{ti, ■ ■ ■ , Vdl) is Vandermonde matrix with m — 1 columns and \Td\ 

833 rows associated to values ti, . . . ,t\r^\. If the {ti)i are distinct, this matrix is 

834 of full rank. 

835 Moreover the reproducing property shows that aifci(t/, .) = im- 

836 plies E1=1«z/(^/) = for all / G Wl\ Hence, = Ker (5^, ^[^^ a;0)^ 

837 where Q denotes the linear form h G H"^ h{ti). As the co-dimension of 

838 "H^" is dim'H™ = m and as, by Assumption (A[T]), B is linearly independent 

839 of J2lti(^iCh we thus have ^[li a«C/ = (or codimKer (_B^, E[^-,^ a^^;) = 

dimlm {B^, S[=i ^iCi) would be m+1). Thus, we obtain that X]l=i (^ifi^i) = 
for all / in "H™ and, as (ti) are distinct, that a; = for all /, leading to the 

842 independence conclusion for the {ki(ti, .));. 

843 Finally, we prove that S\^ra is of full rank. Indeed, if Sx^r^'^'^'^ = 0, 

844 w^Mqx^'^ = and ri'^Mix^^ = 0. As {ujk)k is a basis of n^, w^Mox^^ = 

845 implies Mqx^'* = and therefore Mi = [Ki + AJ^)"^. As shown above, 

846 the {ki{ti, are linearly independent and therefore rjMiyJ'^ = implies 

847 Mix^'* = 0, which in turns leads to x^'* = via the simplified formula for Mi. 
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848 9.2. Corollary U\ 

849 We give only the proof for the classification case, the regression case is 

850 identical. 

851 According to Theorem [H there is a full rank linear mapping from M'"^"^' 

852 to H™, Sx^r^, such that for any function x G H"*, Xa.t^ = <S\,Ti^^'^- Let 

853 us denote Tx^r^ the image of M''^'*! by Sx^r^, Pa.t^ the orthogonal projection 

854 from Ti"^ to Ix,Ta and S^l^ the inverse of Sx,Td on Ix,Ta- Obviously, we have 

856 Let -ip he a measurable function from M'"^''! to {—1, 1}. Then de- 

857 fined on "H™ by Ctp{u) = 4' {"^xla ° ■^A.T'd(^)) ^ measurable function from 

858 "H™ to {—1, 1} (because S^].^ and Px,Ta are both continuous). Then for 

859 any measurable V', inf<^:-H-^{--i,i} P (^4>{Xx,rJ 7^ < F (^(^{Xx,rJ 7^ = 

860 P (^(X^'*) ^ r), and therefore 

inf F((f){Xxr,) t^y) < 

inf P (0(X^'*) 7^ r) . 



861 Conversely, let if) he a, measurable function from T-L^ to {—1,1}. Then (^^ de- 

862 fined on R''^'*' by C^/^lu) = iI){Sx,tS'^))^ is measurable. Then for any measurable 

863 ^, inf^^^|.,|^{_i^,jP(0(X--) ^Y)< P(g(X-'') ^ F) = P (V^(X,,.J ^ F 

864 and therefore 



m4^i.,U{_i,,jP(0(X-^)^r)< 

inf P U{Xx.r,) + Y 

865 The combination of equations (fTOjl and ffTTj) gives equality (j4 

866 ^.c?. Corollary IE 

867 1 . Suppose assumption (A Bbj) is fullfilled 



(11) 



868 The proof is based on Theorem 1 in iFarago and Gyorfil (Il975l ) . This 

869 theorem relates the Bayes risk of a classification problem based on 

870 {X,Y) with the Bayes risk of the problem {Td{X),Y) where (T^) is a 

871 series of transformations on X. 

872 More formally, for a pair of random variables {X,Y), where X takes 

873 values in X, an arbitrary metric space, and Y in {—1,1}, let us 
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denote for any series of functions from X t o itself, L*(Tri) = 
mffi,,x^{-i,i}^ {(p(T(i{X)) Y). Theorem 1 from iFarago and Gyorfi 



fll975h states that E (5(Trf(X), X)) implies L*{Td) L*, 



{n'^,{.,.)L2) with Trf(x) 



where 5 denotes the metric on X. 
This can be applied to X 

^^d,rd = ^^d^rd^^"^'- under Assumptions (A[T]) and (A|2]), Theo- 
rem |2] gives: \\TdiX) - X\\l, < (^AR^^Xd + BR,^j^y\D"'X\\%. 
Taking the expectation of both sides gives E (||Trf(X) — X||^2) < 
^R,m^d + Bji rn [rjam j E (||D'"X||^2) , usiug the fact that the constants 
are independent of the function under analysis. Then under Assump- 

d—>+oo 



tions fAHfej) and ( Al3| 



\Td{X)-X\ 



-)■ 0. According to 



Farago and Gyorfil ( Il975l ). this implies limd->oo L*^ = L*. 



Suppose assumption (A l^lol) is fullfilled 

The conclusion will follow both for classification c ase and for regres - 
sion case. The proof f o llows t he general ideas of Biau et aD ( 2005 ): 
Rossi and Conan-GuezI (l2006h : iRossi and Viilal fl2006h : Ihaloj fl2008h . 
Under assumption (A[T]), by Theorem [T] and with an argument similar to 
those developed in the proof of Corollary [H ai^Xx^^^-^) = a{{X(t)}t^T-J. 
From assumption (A BEI) . a{{X{t)}t^T-^) is clearly a filtration. More- 
over, as E (y) and thus E (F^) are finite, E (^Y\Xx^T-d ^ is a u niformly 
bounded martingal for this filtration (see Lemma 35 of PoUardl (2002)). 
This martingale converges in L^-norm to E (^\ct (^UdC^Xx^ T-^ 
have 



we 



a [{Jd<^{Xxa,Td)j C cr{X) as Xxj,Td is a function of X (via Theo- 
rem [1]); 

by Theorem [2l Xx^^t^ surely^ ^ ^2 proves that X 

is a ({Jda{X, 



^d,Td, 



■d,Td 

measurable. 



Finally, E (Uda(XA„.j)) 



E(y|X) and 



E FIX 



> E(r|x). 



The conclusion follows from the fact that: 
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909 



(a) binary classification case: 



2E 



e(y\X^ )- E{Y\X) 



Devroye et al.l ( 119961 )) concludes the proof; 



the bound L'^ - L* < 
(see Theorem 2.2 of 



(b) regression case: as E {Y"^) is finite, E ( E ( V|X 



is also fi- 



nite and the conv ergence also happe ns for the quadratic norm (see 
Corollary 6.22 in iKallenberd ( 1l997l )). i.e., 



hm E E(r|X)-E 







Hence, as L*^ - L* 
elusion follows. 



E E(r|X)-E FIXa, 



:,Td 



the con- 



912 9.4- Theorem\^ 

913 We have 

L(</.„,,) -V= Lct^^^^^ -L: + L:- L*. (12) 

914 Let e be a positive real. By Corollary [3l it exists do G N* such that, for all 

915 d ^ (Iq, 

L:-L*< e. (13) 

916 Moreover, as shown in Corollary [1] and as Qa^.t^ is invertible, we have 

917 in the binary classification case: = mi^.^\T^\^^_-^^^-^F {(j)(X.'^'^) Y) = 

918 inf^^^|.^|^|_^ -^jP(0(QA^,rdX""'') 7^ F), and in the regression case: L*^ = 



inf 



E([0(X-'')-y]^) =inf^^^i.. 



E 



(QA„.,X-'*)-r]^). Byhy- 



920 pothesis, for any fixed d, (t)n,Ta, is consistent, that is 



lim E(L(0„,,J) 



inf P(0(Q,„.,X^'')^F) 



921 in the classification case and 



hm E(L(0„,,J) 

n— >4-oo 



inf E([0(Q,„.,X-)-r]^) 



922 in the regression case, 

923 lim„_^+ooE ^L(0„^^^jj = 

924 (|T3|) . this concludes the proof. 



T* 



and therefore for any fixed do, 
Combined with equations (|T2|) and 
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